The project report is presented as an HTML document, mainly to provide an improved reader experience, and a greater degree of interactivity.
See the Project Outcome section. Be sure to read the Project Plan first though, to obtain a background understanding of the data.
Visualisations are presented twice: once within the code, and then again at the end, in the discussion on the Project Outcome. User-friendly annotations and chart descriptions are available in the latter section.
Important visualisations are linked to from this document, and these visualisations can be found in the ./Report Output directory.
./config contains some Kepler chart configuration files that would've made this document even longer had they been included.
./Intermediate Output contains a set of OSRM routes, so that the user doesn't need to spend too much time querying OSRM.
./ReportStills contains some images used in the report.
The code can be rerun by using the Bicycle Sharing in London.ipynb notebook. It tries its best to recreate the required directories, and download the required data files. A series of pip install commands are also included to aid with the required package installs.
The script contains functionality to fetch the required data files from the TfL Open Data portal. Data has not been included, in order to reduce the submission size (Gradescope automatically decompresses zip files and takes a long time to upload). Running the code will download and process all required data files automatically.
London's TfL Cycle Hire Scheme, sponsored by Santander, has been operating since 2010 as a bicycle rental network. Tourists and residents alike have found the scheme to be one of the most cost-effective means of transportation in the city.
Every week, TfL publishes a dataset detailing cycle journeys undertaken on the network. It contains information about the bicycle used, the duration of the trip, as well as the start- and end station where the bicycle was rented from and returned to. With this dataset and the locations of every station, bicycle routes can be traced between them to gain an understanding of how these bicycles are used on the network.

The dataset consists of two sources of data: The first data set (A) is taken from TfL’s Open Data website, and contains a row for every TfL Cycle Hire journey. Each journey is defined by the following properties:
Rental Id which is a unique identifier for the rental. There is only one journey per row, so every Rental Id is unique.Duration in seconds. This is the duration of the rental to the nearest 10 seconds.StartStation Id. The unique identifier for the bicycle rental station where the journey began.EndStation Id. The unique number for the rental station where the bicycle was left after the journey ended.Bicycle Id. An identifier for the bicycle that was involved in the rental.In addition to the Cycle Journey Data, a further dataset is incorporated providing details on the cycle docking stations (B). This dataset was not available on TfL’s Open Data website, but was linked in a Freedom of Information Request to Transport for London. It includes the following features:
Docking Station (Name). This is the natural name description for the station.Docking Station Id. A unique identifier that joins onto the Cycle Journey data.Docking Points (Count). The number of docking points at that station. Bicycles on the network need to be docked at a docking point.Latitude and Longitude. These are the station’s geographic coordinates in the WGS84 projection.From the above, a cycle journey may be defined as:
A
journeyis defined by a particular bicycle (with a uniqueBicycle Id) rented from one docking station on the network (StartStation Id), and returned to another (EndStation Id). The journey starts at a particular date (StartDate), and lasts a certainDuration, measured to the nearest 10 seconds. It is important to note that GPS traces from the individual bicycles are not available, and thus the exact routes taken are unknown.
Although this is a rich dataset to work with, it contains certain flaws.
Time Consistency. The statistics have been published on a regular basis since 2012. Over this time period, some column names have changed in certain files, and station identifiers seem to have shifted.
Station Identifiers. An important component of the analysis is the location of the docking stations - where bicycles are rented from and returned to. The unique identifiers for these stations do not seem to be well-defined. In the Station Data provided by TfL in the FOI response, there are numerous duplicate identifiers - that is, stations in different locations with the same identifiers.
This project aims to gain insight into how travellers use the TfL cycle network, as well as how they move throughout the city. The report will investigate whether cycle hire serves a functional commuter service, and how commuter use compares to recreational use on the network.
Write a script that can collect cycle hire data in an automated way from the TfL Open Cycling Data portal, and prepare it for analysis. The script should be capable of running in a scheduled manner to append data to the existing local store for further analysis.
Although this particular analysis will focus on a smaller, more recent period, the script should allow queries for data from any period existing in the data store. This would aid future analysis (such as allowing comparison over different years of data).
This objective also includes a data cleaning and validation requirement. The data should be formatted such that it would be quick and efficient to load (despite the volume), and validation should be performed to ensure that the data is trustworthy.
Lastly, the base data will be enriched with assumed route information. Adding routes will allow rough distance and speed calculations, and aid in understanding the movement of bicycles through the network.
The next objective aims to understand how the cycle hire scheme is used.
The section should start with a high-level analysis, looking at journey duration, distance and speed.
Bicycle journeys throughout the network will be visualised in an effort to understand movement patterns and popular routes.
A temporal analysis will be included to understand how commuter use compares to recreational use.
The ideal bicycle sharing network is self-balancing. That is, for every bicycle rented at a particular station, one is returned. This requires minimal logistical effort on the part of the network administrator.
The report will investigate the extent to which the cycle hire network is self-balancing, and how the way in which the system is used, affects the balance of the network.
A visualisation of the offending stations will be presented, hopefully providing the reader with an intuitive understanding of why imbalance might exist in the network.
The architecture used in this analysis consists of five main stages:

There are a few core processing modules and algorithms used in the analysis. Considering each phase in turn:
Data Loading Script
- HTML Parsing: Using
BeautifulSoup, the data load script parses the TfL Open Cycling Data portal’s HTML page for links to Journey Extract CSV files.- Network Requests: The
Requestslibrary is used to download CSV extract files.
Data Enrichment
- Routing: The
Open Source Routing Machine(OSRM) is queried (usingRequests, via an open API), to obtain route geometries between stations.
Journey Analysis
- Geospatial Analysis:
Geopandasis used to infer the route travelled, and allows the calculation of journey distances.
Visualisation
- Charting is done using
Matplotlib(usually via it’s extension onto Pandas). KDE plots are used to illustrate the distribution of variables, and line charts to show feature changes over time.- Geospatial Plotting uses
Plotly(with a Mapbox basemap) andKeplerGL.KeplerGLwas found to be more performant as it uses WebGL for graphics operations, however it requires more setup thanPlotly.KeplerGLalso allows charting types that are not supported withPlotly, such as plotting GPS trip traces over time.
!pip install beautifulsoup4
!pip install geojson
!pip install geopandas
!pip install keplergl
!pip install plotly
!pip install polyline
!pip install pyarrow
!pip install shapely
!pip install seaborn
!pip install scipy
!pip install tqdm
BASE_DIR = './'
from bs4 import BeautifulSoup
from datetime import time
from geojson import LineString, Feature, FeatureCollection
import geopandas as gpd
import json
import keplergl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pickle
import plotly.express as px
import polyline
import requests
from shapely import geometry as shapely_geometry
from tqdm import tqdm
px.set_mapbox_access_token('pk.eyJ1Ijoicm1vc3RlcnQiLCJhIjoiY2tiZ3JjdWpyMG9pOTJ5bXptazQxYzY3dyJ9.UxiB2rAg3oBNH0yx0L9LcA')
BeautifulSoup is used to parse the TfL Cycling Data page and extract all CSV file links from it.DateTime is used for date and time operations within the script.GeoJSON provides functionality for compiling GeoJSON files.GeoPandas is a package based on pandas, but providing additional functionality for GeoSpatial operations.JSON helps read and write files to JSON format.KeplerGL is an efficient Geospatial plotting tool, which will help visualise journey arcs in an interactive chart.Numpy has a useful set of methods to help with numperical operations on the data.os is used to access and manipulate files on disk. This is needed to organise all the CSV data files.Matplotlib provides tools for simple charting operations in the high-level analysis.Pandas provides a DataFrame structure to hold the data, as well as several crutial data grouping and aggregation functions.Pickle provides functionality to save Python objects to disk.Plotly is a plotting library that will be used to chart the station activity data on a map.Polyline is used to convert journey route geometry from OSRM into a set of coordinates.Requests provides functionality to download data, as well as to get responses from the public OSRM server for routing between docking stations.Shapely provides geometry classes, which are useful for distance calculations and interacting with GeoDataFrames.TQDM helps to monitor the progress of loops for longer-running operations.DATA_DIR = BASE_DIR + 'TfL Cycling Journey Data/'
INTERIM_OUTPUT_DIR = BASE_DIR + 'Intermediate Output/'
REPORT_OUTPUT_DIR = BASE_DIR + 'Report Output/'
for DIRECTORY in [DATA_DIR, INTERIM_OUTPUT_DIR, REPORT_OUTPUT_DIR]:
if not os.path.exists(DIRECTORY):
os.mkdir(DIRECTORY)
Start by loading a webpage from the TfL Cycling Data website.
This includes a list of links of CSV files made available by TfL.
This script below will search the html page for links to Journey Data CSV files and sequentially download every required file and store it in a local directory.
with open(DATA_DIR + 'cycling.data.tfl.gov.uk.html', 'r') as f:
cycling_data_page = f.read()
BeautifulSoup is used to parse the html page and get every hyperlink tag (href) on the page.
soup = BeautifulSoup(cycling_data_page, 'html.parser')
# Pull out every Journey Data Extract file listed on the page:
journey_data_urls = []
for link_elements in soup.find_all('a'):
link = (link_elements.get('href'))
if 'JourneyDataExtract' in link:
journey_data_urls.append(link)
A list of URLs to CSV files containing journeys:
journey_data_urls[:5]
['https://cycling.data.tfl.gov.uk/usage-stats/01aJourneyDataExtract10Jan16-23Jan16.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/01bJourneyDataExtract24Jan16-06Feb16.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/02aJourneyDataExtract07Fe16-20Feb2016.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/02bJourneyDataExtract21Feb16-05Mar2016.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/03JourneyDataExtract06Mar2016-31Mar2016.csv']
Journey extract files for 2022:
[url for url in journey_data_urls if '2022' in url][:5]
['https://cycling.data.tfl.gov.uk/usage-stats/298JourneyDataExtract29Dec2021-04Jan2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/299JourneyDataExtract05Jan2022-11Jan2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/300JourneyDataExtract12Jan2022-18Jan2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/301JourneyDataExtract19Jan2022-25Jan2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/302JourneyDataExtract26Jan2022-01Feb2022.csv']
This script (together with the cleaning functionality below) allows parsing CSV files from any period.
But, for the purposes of this analysis, only a sample from June and July of this year will be downloaded (at the time of writing, 05 July 2022 was the latest available dataset).
NOTE: To run the analysis for any other year/month of Cycle Data, replace the year and month variables declared below. The download and validation scripts allow any period from 2015 onwards.
SELECTED_YEARS = [2022] # Filter years to be one of the years in this list
SELECTED_MONTHS = ['jun', 'jul'] # Filter months to be in this list
selected_journey_data = [
url for url in journey_data_urls
if any([str(selected_year) in url for selected_year in SELECTED_YEARS])
and any([selected_month.lower() in url.lower() for selected_month in SELECTED_MONTHS])
]
selected_journey_data
['https://cycling.data.tfl.gov.uk/usage-stats/320JourneyDataExtract01Jun2022-07Jun2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/321JourneyDataExtract08Jun2022-14Jun2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/322JourneyDataExtract15Jun2022-21Jun2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/323JourneyDataExtract22Jun2022-28Jun2022.csv', 'https://cycling.data.tfl.gov.uk/usage-stats/324JourneyDataExtract29Jun2022-05Jul2022.csv']
All Cycle Journey Data Extract files are downloaded and stored in a folder on the local machine:
for journey_data_url in tqdm(selected_journey_data):
fname = journey_data_url.split('/')[-1]
path = DATA_DIR + '/' + fname
if os.path.exists(path):
continue
r = requests.get(journey_data_url)
with open(path, 'a') as f:
f.write(r.text)
100%|██████████████████████████████████████████| 5/5 [00:00<00:00, 24995.85it/s]
The Journey Data above doesn't provide the coordinates for the docking stations, which is crutial for our analysis.
A dataset containing the locations for every station was therefore retreived from a Freedom of Information request in 2021 (FOI-0547-2122/GH). See: https://www.whatdotheyknow.com/request/tfl_santander_bicycle_sharing_sc_3
FOI_TFL_CYCLE_STATION_DATA_URL = 'https://www.whatdotheyknow.com/request/766040/response/1846134/attach/4/FOI%200689%202122.csv.txt'
foir_station_df = pd.read_csv(FOI_TFL_CYCLE_STATION_DATA_URL)
The separate journey csv files contain columns that do not always match up. Column names should be consistently formatted accross the datasets.
# {`Original Column Name`: `Desired Column Name in Output`}
column_name_mappings = {
'rentalid': 'rental_id',
'startdate': 'start_date',
'enddate': 'end_date',
'bikeid': 'bike_id',
'duration': 'duration_s',
'duration_seconds': 'duration_s',
'startstationid': 'start_station_id',
'endstationid': 'end_station_id',
'endstationname': 'end_station_name',
'startstationname': 'start_station_name',
}
Progressively inspect every CSV file and ensure that it has the required columns. Apply optimisations to help lower the size of the ultimate data file.
# To validate station names and identifiers, they are logged in sets.
# They get pickled in the procedure.
if not os.path.exists(INTERIM_OUTPUT_DIR + 'unique_station_identifiers.pickle'):
unique_station_identifiers = set()
else:
with open(INTERIM_OUTPUT_DIR + 'unique_station_identifiers.pickle', 'rb') as f:
unique_station_identifiers = pickle.load(f)
for csv_path in tqdm(os.listdir(DATA_DIR)):
feather_path = csv_path.replace('.csv', '.feather')
# Check whether we've already processed this csv file:
if feather_path in os.listdir(DATA_DIR):
continue
if '.csv' in csv_path:
journey_extract_df = pd.read_csv(DATA_DIR + '/' + csv_path)
# Sometimes TfL might write StartStation ID, and in other files they write Start Station Id.
# Let's just remove all spaces and convert to lower to try and by as agnostic as possible to future changes.
journey_extract_df.columns = [col.replace(' ', '').strip().lower() for col in journey_extract_df.columns]
journey_extract_df.rename(column_name_mappings, axis='columns', inplace=True)
# There are a few essential columns that are required for every journey.
essential_numeric_columns = ['bike_id', 'end_station_id', 'start_station_id', 'duration_s']
# If all the essential columns are not available, ignore the csv file.
if not all(col in journey_extract_df.columns for col in essential_numeric_columns):
continue
journey_extract_df.dropna(subset=essential_numeric_columns, inplace=True)
journey_extract_df.reset_index(drop=True, inplace=True)
# Reduce memory use by storing smaller integers as smaller types:
for colname in essential_numeric_columns:
journey_extract_df[colname] = journey_extract_df[colname].astype('int16')
# Convert dates to proper formats:
journey_extract_df['start_date'] = pd.to_datetime(journey_extract_df['start_date'], dayfirst=True)
journey_extract_df['end_date'] = pd.to_datetime(journey_extract_df['end_date'], dayfirst=True)
# Remove any weird journeys with impossible dates
journey_extract_df = journey_extract_df[journey_extract_df['start_date'] < journey_extract_df['end_date']]
journey_extract_df['duration_s'] = journey_extract_df['duration_s'].abs()
# Next, journeys are stored to the data folder.
# Unnecessary / repeating columns are excluded (like station names).
journey_extract_df[
essential_numeric_columns +
['rental_id', 'end_date', 'start_date']
].to_feather(DATA_DIR + '/' + feather_path)
# Validate the stations in the journey data against those in the
# docking station dataset
for station_id_name_pair in list(journey_extract_df.groupby(['start_station_id', 'start_station_name']).groups):
unique_station_identifiers.add(station_id_name_pair)
for station_id_name_pair in list(journey_extract_df.groupby(['end_station_id', 'end_station_name']).groups):
unique_station_identifiers.add(station_id_name_pair)
# Pickle the sets to serve as a backup if the script is interrupted.
with open(INTERIM_OUTPUT_DIR + 'unique_station_identifiers.pickle', 'wb') as f:
pickle.dump(unique_station_identifiers, f)
100%|████████████████████████████████████████| 12/12 [00:00<00:00, 18957.31it/s]
One of the most important features for the purposes of our analysis are the station ids. These need to be linked to the Docking Station dataset from the Freedom of Information request. So it's worth checking that the identifiers are all unique and pointing to the same docking stations.
Start by loading all the identifiers, together with their station names into a DataFrame and counting their occurrence in the dataframe to identify duplicate indices.
journeys_station_df = pd.DataFrame(
unique_station_identifiers,
columns = ['station_id', 'station_name']
)
journeys_station_df = journeys_station_df.join(
journeys_station_df.station_id.value_counts().rename('occurrence_count'),
on='station_id'
)
Attempt string matching on the station names. The names and areas of the duplicate stations should roughly match. That way one can be relatively comfortable that the id is pointing to roughly the same docking station (or - at the very least - the same neighbourhood).
journeys_station_df['station_first_name'] = journeys_station_df.station_name.str.split(',').str[0].str.replace('\d+', '', regex=True).str.strip().str.lower()
journeys_station_df['station_area'] = journeys_station_df.station_name.str.split(',').str[1].str.replace('\d+', '', regex=True).str.strip().str.lower()
journeydata_problematic_station_ids = []
for station_id, group in journeys_station_df[
journeys_station_df.occurrence_count > 1
].groupby('station_id'):
if not (
len(group.station_first_name.unique()) == 1
or len(group.station_area.unique()) == 1
):
journeydata_problematic_station_ids.append(station_id)
Check if there are any problematic station ids that aren't unique over the time period for the dataset.
NOTE: This analysis considers a narrow slice of the data (June and July 2022), however, this script allows analysis for any period. Although no duplicates are found for this period, duplicates are quite common in earlier Cycle Journey Extract files.
len(journeydata_problematic_station_ids)
0
If there are any problematic station ids, remove them:
journeys_station_df[
journeys_station_df.station_id.isin(journeydata_problematic_station_ids)
].sort_values('station_id')
| station_id | station_name | occurrence_count | station_first_name | station_area |
|---|
Rename the columns, convert the coordinate values into floats, and matche the datatypes for the identifiers to those types used in the Cycle Journey Extract data.
# Rename the columns to align with Python convention.
# Add an `foir_` prefix to feature names that exist in the `Cycle Journey Extract` dataset.
foir_station_df.columns = ['go_live', 'foir_docking_station_name', 'foir_docking_station_id',
'n_docking_points', 'latitude', 'longitude']
foir_station_df['latitude'] = foir_station_df['latitude'].astype(float)
foir_station_df['longitude'] = foir_station_df['longitude'].astype(float)
foir_station_df['foir_docking_station_id'] = foir_station_df['foir_docking_station_id'].astype('int16')
foir_station_df['n_docking_points'] = foir_station_df['n_docking_points'].astype('int16')
foir_station_df
| go_live | foir_docking_station_name | foir_docking_station_id | n_docking_points | latitude | longitude | |
|---|---|---|---|---|---|---|
| 0 | Jul-10 | River Street, Clerkenwell | 1 | 19 | 51.529200 | -0.109971 |
| 1 | Jul-10 | Phillimore Gardens, Kensington | 2 | 37 | 51.499600 | -0.197574 |
| 2 | Jul-10 | Christopher Street, Liverpool Street | 3 | 32 | 51.521300 | -0.084606 |
| 3 | Jul-10 | St. Chad's Street, King's Cross | 4 | 23 | 51.530100 | -0.120974 |
| 4 | Jul-10 | Sedding Street, Sloane Square | 5 | 27 | 51.493100 | -0.156876 |
| ... | ... | ... | ... | ... | ... | ... |
| 832 | Dec-20 | Canada Water Station, Rotherhithe | 844 | 35 | 51.498439 | -0.049150 |
| 833 | Jan-21 | Clapham Common Station, Clapham Common | 355 | 21 | 51.461388 | -0.139221 |
| 834 | Jan-21 | Gauden Road, Clapham | 808 | 28 | 51.464995 | -0.130909 |
| 835 | Jan-21 | Bermondsey Station, Bermondsey | 845 | 30 | 51.497995 | -0.063471 |
| 836 | Apr-21 | London Fields, Hackney Central | 614 | 21 | 51.541190 | -0.058826 |
837 rows × 6 columns
Even this dataset contains some duplicate station ids. Only duplicate stations that point to roughly the same coordinates can remain in the dataset:
# Count the occurrence of every station_id and include it as a column:
foir_station_df = foir_station_df.join(
foir_station_df.foir_docking_station_id.value_counts().rename('id_occurrence_count'),
on='foir_docking_station_id'
)
foir_problematic_station_ids = []
# Iterate through groups of docking stations with the same `foir_docking_station_id`
for foir_docking_station_id, group in foir_station_df[
foir_station_df.id_occurrence_count > 1
].groupby('foir_docking_station_id'):
# If they do not all have roughly the same coordinates, mark the station id as problematic.
if not(len(group.latitude.round(3).unique()) == 1 and len(group.longitude.round(3).unique()) == 1):
foir_problematic_station_ids.append(foir_docking_station_id)
There are a number of problematic stations in the FOI dataset. These stations all have the same identifier, but point to very different locations:
foir_station_df[
foir_station_df.foir_docking_station_id.isin(foir_problematic_station_ids)
].sort_values('foir_docking_station_id')
| go_live | foir_docking_station_name | foir_docking_station_id | n_docking_points | latitude | longitude | id_occurrence_count | |
|---|---|---|---|---|---|---|---|
| 342 | Oct-10 | Oval Way, Vauxhall | 355 | 21 | 51.486600 | -0.117286 | 2 |
| 833 | Jan-21 | Clapham Common Station, Clapham Common | 355 | 21 | 51.461388 | -0.139221 | 2 |
| 459 | Mar-12 | Castalia Square, Cubitt Town | 474 | 39 | 51.498100 | -0.011457 | 2 |
| 831 | Dec-20 | Gascoyne Road, Victoria Park | 474 | 20 | 51.541515 | -0.038558 | 2 |
| 467 | Mar-12 | Thornfield House, Poplar | 482 | 26 | 51.509800 | -0.023770 | 2 |
| 824 | Jul-20 | Exhibition Road Museums 2, South Kensington | 482 | 14 | 51.499900 | -0.174554 | 2 |
| 574 | Oct-13 | Bradmead, Nine Elms | 614 | 40 | 51.478200 | -0.144691 | 2 |
| 836 | Apr-21 | London Fields, Hackney Central | 614 | 21 | 51.541190 | -0.058826 | 2 |
| 775 | Feb-16 | Bevington Road, North Kensington | 805 | 27 | 51.520074 | -0.206338 | 2 |
| 822 | Jun-20 | The Metropolitan, Portobello | 805 | 18 | 51.520509 | -0.200805 | 2 |
| 768 | Dec-15 | Stockwell Roundabout, Stockwell | 808 | 33 | 51.473500 | -0.122556 | 2 |
| 834 | Jan-21 | Gauden Road, Clapham | 808 | 28 | 51.464995 | -0.130909 | 2 |
This dataset will be joined onto the journey dataset. Therefore, it's also worth flagging any stations that do not match between the two datasets.
full_station_df = foir_station_df.merge(
journeys_station_df,
left_on='foir_docking_station_id',
right_on='station_id',
how='outer'
)
Stations are named as (station_name, area). The station name and area fields are compared separately. If both the station name and the area does not match up, the docking_station_id is flagged as problematic.
full_station_df['foir_station_first_name'] = (
full_station_df.foir_docking_station_name.str.split(',').str[0].str.replace('\d+', '', regex=True).str.strip().str.lower()
)
full_station_df['foir_station_area'] = (
full_station_df.foir_docking_station_name.str.split(',').str[1].str.replace('\d+', '', regex=True).str.strip().str.lower()
)
non_matching_ids = full_station_df[(
(full_station_df.foir_station_first_name != full_station_df.station_first_name) &
(full_station_df.foir_station_area != full_station_df.station_area)
)].foir_docking_station_id.unique()
The full set of station ids to exclude:
all_problematic_station_ids = (
set(journeydata_problematic_station_ids)
or set(foir_problematic_station_ids)
or set(non_matching_ids)
)
print(f"Exclude {len(all_problematic_station_ids)} docking station ids from the analysis.")
Exclude 6 docking station ids from the analysis.
Prepare our FOI station dataframe and save it:
foir_station_df = foir_station_df[
(~foir_station_df.foir_docking_station_id.isin(all_problematic_station_ids))
]
foir_station_df = foir_station_df.sort_values(
'go_live',
ascending=False
).groupby('foir_docking_station_id').first()
foir_station_df.drop(
['go_live', 'id_occurrence_count'],
axis='columns',
inplace=True
)
# Remove the foir column name prefix that was added earlier.
foir_station_df.columns = [col.replace('foir_', '') for col in foir_station_df.columns]
Plotting the stations that are available:
def plot_docking_stations(docking_station_df, zoom=8.5, title=None):
return px.scatter_mapbox(
docking_station_df.reset_index(),
lat='latitude',
lon='longitude',
size='n_docking_points',
color='n_docking_points',
opacity=0.7,
zoom=zoom,
hover_data=['docking_station_name', 'foir_docking_station_id'],
title=title
)
plot_docking_stations(foir_station_df)
From the above chart, it is clear that there is one station that is located significantly further South than the rest of the docking stations.
By hovering over the marker, it displays a docking_station_id of 502 and is listed as South Quay West, Canary Wharf.
This is clearly an error, as that's nowhere near Canary Wharf. Fix the station coordinates by just setting them equal to the coordinates for South Quay East, Canary Wharf (with id: 494). For the purposes of this analysis, a rough location is good enough.
foir_station_df.loc[502, ['latitude', 'longitude']] = foir_station_df.loc[494, ['latitude', 'longitude']]
plot_docking_stations(foir_station_df)
Save the dataset to the local disk for easy reference in the next phase of the analysis.
foir_station_df.reset_index().to_feather(INTERIM_OUTPUT_DIR + 'FOI Station Data Cleaned.feather')
In this phase, the Open Source Routing Machine (OSRM) open API is used to find a reasonable bike route between stations. This will allow inference on distance, as well as an understanding of cycle movement in London.
SIMPLIFYING ASSUMPTION: This report attempts route analysis in the absence of GPS trace data. The very strong assumption is therefore made that every cycle trip with the same origin & destination coordinates take the same OSRM route through the city.
Reload the cleaned station data, along with the journey data:
foir_station_df = pd.read_feather(INTERIM_OUTPUT_DIR + 'FOI Station Data Cleaned.feather').set_index('foir_docking_station_id')
journey_df = pd.concat([
pd.read_feather(DATA_DIR + '/' + fname)
for fname in os.listdir(DATA_DIR)
if '.feather' in fname
])
print(f"{len(journey_df)} cycle journeys and {len(foir_station_df)} docking stations have been loaded.")
1475810 cycle journeys and 797 docking stations have been loaded.
With 837 unique stations, there are 837 x 836 (~700k) unique combinations of start and end points that theoretically require routes. 700k requests to the open OSRM server would take an immense amount of time to process. As an alternative, only the popular routes on the network that make up the majority of journeys can be queried.
# The station data frame contains a row for every cycle hire station point on the TfL cycle hire network.
# Join this onto the journey data frame, and count the most popular journeys taken on the network.
def get_journey_counts_from_journey_data(df):
"""
Count the unique journeys in a dataframe containing journey row data, and return the output.
:param df: A DataFrame containing journey data. Requires the columns `start_station_id` and `end_station_id`
:return:
"""
global foir_station_df
journey_count_df = df[['start_station_id', 'end_station_id']].value_counts().to_frame()
journey_count_df.columns = ['Journey Count']
journey_count_df.reset_index(inplace=True)
# Exclude journeys that start and end at the same station:
journey_count_df = journey_count_df[journey_count_df['start_station_id'] != journey_count_df['end_station_id']]
# Join onto the station data:
journey_count_df = journey_count_df.join(
foir_station_df,
on='start_station_id'
).join(
foir_station_df,
on='end_station_id',
rsuffix='_destination'
).dropna(
subset=['docking_station_name', 'docking_station_name_destination']
)
return journey_count_df
journey_counts = get_journey_counts_from_journey_data(journey_df)
Select a subsample of the most popular journeys to find route information for:
journey_count_sample = journey_counts[
journey_counts['Journey Count'] > journey_counts['Journey Count'].quantile(0.9)
]
Next, the most popular journeys are selected and routes are obtained for these journeys.
In order to do this, the coordinates of the Origin and Destination stations will be passed to a map routing engine. The "Open Source Route Machine"'s public endpoint is used to achieve this. See: https://project-osrm.org/
IMPORTANT: The code below illustrates how the route geometry dataset was created. Those only interested in the analysis can skip to the next cell, which loads the results from a locally stored Pickle file.
OSRM_BASE_URL = "https://router.project-osrm.org/route/v1"
def osrm_route_directions(
origin_lat: float, origin_lon: float,
destination_lat: float, destination_lon: float,
profile: str='bike'
):
"""
Return a linestring bicycle route between an origin and destination using the Open Source Routing Machine.
See: https://project-osrm.org/docs/v5.24.0/api/#
:param origin_lat: The latitude for the starting point's coordinates (WGS84).
:param origin_lon: The longitude for the starting point's coordinates (WGS84).
:param destination_lat: The latitude for the destination point's coordinates (WGS84).
:param destination_lon: The longitude for the destination point's coordinates (WGS84).
:param profile: The OSRM profile to route with. This can be `car`, `bike`, or `foot` (for pedestrian directions).
:return: A LineString representing the recommended OSRM route between the origin and destination.
"""
# Format the OSRM route URL and get the result from the open OSRM endpoint
coord_string = f"{origin_lon},{origin_lat};{destination_lon},{destination_lat}"
url = f"{OSRM_BASE_URL}/{profile}/{coord_string}"
routes = requests.get(url).json()['routes']
# OSRM can return multiple routes. Take the first line, convert it to Shapely geometry and return it.
if len(routes) > 0:
geometry = polyline.decode(routes[0]['geometry'], geojson=True)
return shapely_geometry.LineString(geometry)
return None
if not os.path.exists(INTERIM_OUTPUT_DIR + 'station_pair_route_geometries.pickle'):
station_pair_route_geometries = {}
else:
with open(INTERIM_OUTPUT_DIR + 'station_pair_route_geometries.pickle', 'rb') as f:
station_pair_route_geometries = pickle.load(f)
for ix, row in tqdm(journey_count_sample.reset_index(drop=True).iterrows()):
if (row.start_station_id, row.end_station_id) in station_pair_route_geometries:
continue
station_pair_route_geometries[(row.start_station_id, row.end_station_id)] = osrm_route_directions(
row['latitude'], row['longitude'],
row['latitude_destination'], row['longitude_destination']
)
# Pickle the sets to serve as a backup if the script is interrupted.
if ix%100 == 0:
with open(DATA_DIR + 'station_pair_route_geometries.pickle', 'wb') as f:
pickle.dump(station_pair_route_geometries, f)
21929it [00:00, 51375.01it/s]
The OSRM cycle route geometries are loaded into a GeoDataFrame to allow for easier processing.
station_id_cross_with_geometry = gpd.GeoDataFrame.from_dict(
station_pair_route_geometries, orient='index', columns=['geometry'], geometry='geometry', crs='EPSG:4326'
)
/opt/homebrew/Caskroom/miniforge/base/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:128: ShapelyDeprecationWarning: The array interface is deprecated and will no longer work in Shapely 2.0. Convert the '.coords' to a numpy array instead.
station_id_cross_with_geometry.plot()
<AxesSubplot: >
Next, the analysis will analyse the journey data and prepare visualisations to aid in understanding how the TfL Cycle Network is used.
The charts below illustrate the distribution of a typical bicycle hire on the network. It also compares Weekday trips to Weekend Trips.
journey_df['is_weekend'] = (journey_df.start_date.dt.weekday > 4)
journey_df['duration_min'] = journey_df.duration_s / 60
def distribution_plot_weekend_vs_weekday(df, y, xlim):
axes = plt.gca()
for weekend_flag in [True, False]:
label = 'During the Weekend' if weekend_flag else 'During Weekdays'
axes = df[df.is_weekend == weekend_flag].plot(
y=y,
kind='kde',
xlim=xlim,
ax=axes,
label=label
)
return axes
def plot_cycle_hire_duration_distribution():
distribution_plot_weekend_vs_weekday(journey_df, 'duration_min', (0, 80))
plt.title("Distribution of Cycle Hire Duration")
plt.xlabel("Rental Duration in Minutes")
plt.show()
plot_cycle_hire_duration_distribution()
print(f"The median journey is {journey_df.duration_min.median()} minutes long.")
The median journey is 15.0 minutes long.
prop_journeys_under_30_min = ((journey_df.duration_min < 30).sum() / len(journey_df)).round(2)
print(f"{100 * prop_journeys_under_30_min}% of journeys are less than 30 minutes.")
86.0% of journeys are less than 30 minutes.
From the rental duration distribution above, most cycle journeys last between 5 and 20 minutes, both during the weekend and during the week. There is also a sharp drop-off towards the 30-minute mark.
From 2015 to October 2022, the TfL Cycle Hire scheme charged £2 per day for a cycle, which allowed 30 minutes of use. If the cycle was not re-docked within the 30-minute time limit, an additional £2 would be charged for every additional 30-minute period (or part thereof).
The pricing scheme could therefore explain the steeper drop-off between 20 and 30 minutes (compared to 10min-20min).
Next, the routing information is used to analyse rental journey distance and speed.
Start by joining the route geometry onto the journey data:
# Route geometries are indexed by `(start_station_id, end_station_id)` tuples.
# Derive this tuple and store it as a column. Route geometry will be joined on this column.
journey_df['station_pair'] = journey_df[['start_station_id', 'end_station_id']].apply(tuple, 1)
journey_df = journey_df.join(
station_id_cross_with_geometry,
on='station_pair',
how='inner'
)
WGS84 (alias: EPSG:4326) is a widely-used coordinate reference system. However, Euclidean distance calculations between coordinates do not translate to distance measured on the surface of the Earth (due to the spherical nature of the planet). EPSG:27700 provides a coordinate projection that allows simple distance calculations between coordinates.
Therefore, coordinates are converted to EPSG:27700 to calculate the distance of the route geometries.
# Convert to a GeoDataFrame to allow for distance calculations:
journey_df = gpd.GeoDataFrame(journey_df, geometry='geometry', crs='EPSG:4326')
journey_df['distance_m'] = journey_df.to_crs('EPSG:27700').geometry.length
MILES_TO_METERS = 1609.34 # A coefficient to convert between distances in miles and meters.
journey_df['distance_mi'] = journey_df['distance_m'] / MILES_TO_METERS
def plot_cycle_hire_distance_distribution():
distribution_plot_weekend_vs_weekday(journey_df, 'distance_mi', (0, 6))
plt.title("Distribution of Cycle Journey Distance*")
plt.xlabel("Distance of the recommended route (mi)")
plt.figtext(0.1, -0.1, "*Assuming bicycles take the OSRM-recommended route between two stations.")
plt.show()
plot_cycle_hire_distance_distribution()
print(f"The median journey is {round(journey_df.distance_mi.median(), 2)} miles long")
The median journey is 1.56 miles long
The reader is reminded of the strong assumption made by this chart. Namely, that every journey is assumed to take the OSRM-recommended route between the start- and end station.
The chart above is roughly a lower-bound on the distance travelled during the trip, as the user would likely have made a detour (especially during the weekend).
The distribution does however show that the majority of cycle journeys are in the ½-3 mile range. What is somewhat surprising, is that the distribution of Cycle Journey distance during the weekend is not significantly different from those in the week.
Next, the distances calculated in the previous section can be combined with journey duration to calculate an assumed average cycle speed for the journey.
MPS_TO_MPH = 2.23694 # A coefficient to convert between meters-per-second and miles-per-hour
journey_df['speed_mph'] = MPS_TO_MPH * journey_df['distance_m'] / journey_df['duration_s']
The speed calculation can also be used to filter out possible errors in the data:
journey_df = journey_df[journey_df['speed_mph'] < 30]
def plot_cycle_journey_speed_distribution():
distribution_plot_weekend_vs_weekday(journey_df, 'speed_mph', (0, 30))
plt.title("Distribution of Average Cycle Journey Speed*")
plt.xlabel("Average Cycle Journey Speed (mph)")
plt.figtext(0.1, -0.1, "*Assuming bicycles take the OSRM-recommended route between two stations.")
plt.show()
plot_cycle_journey_speed_distribution()
Assuming that bicycles take the recommended route between stations, the majority of cycle journeys travel between 5-15mph on average over the journey. It is notable that average cycle speed has a significantly lower distribution for trips that happen on the weekend, compared to weekday trips. Given that there is no significant difference between assumed trip distances on the weekend versus weekdays, it is safe to deduce that cyclists are more rushed during the week than on the weekend.
Plot the most popular weekday routes on a map using Kepler, varying the stroke width and opacity based on the journey weight.
journey_df.is_weekend.value_counts()
False 729144 True 252459 Name: is_weekend, dtype: int64
Get journey counts between stations during the week:
journey_counts_df = get_journey_counts_from_journey_data(
journey_df
)
Assign the unique (start_station_id, end_station_id) tuple as the index, and join onto the OSRM route geometry:
journey_counts_df.index = journey_counts_df[['start_station_id', 'end_station_id']].apply(tuple, 1)
journey_counts_df = journey_counts_df.join(station_id_cross_with_geometry)
Convert to a GeoDataFrame and save the result.
journey_counts_gdf = gpd.GeoDataFrame(journey_counts_df.reset_index(drop=True), geometry='geometry')
journey_counts_gdf.to_file(INTERIM_OUTPUT_DIR + 'TfL Cycle Journey Counts.geojson')
Load the chart configuration json and pass the route data to the chart:
with open('./config/route_popularity_kepler_config.json') as f:
route_popularity_kepler_config = json.load(f)
route_popularity_kepler_chart = keplergl.KeplerGl(
height=500,
data={
'p5nn3waud': journey_counts_gdf[
journey_counts_gdf['Journey Count'] > 50
]
},
config=route_popularity_kepler_config
)
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter

route_popularity_kepler_chart.save_to_html(
file_name=REPORT_OUTPUT_DIR + 'Route Popularity Visualisation.html',
data={
'p5nn3waud': journey_counts_gdf[
journey_counts_gdf['Journey Count'] > 50
]
},
config=route_popularity_kepler_config,
read_only=True
)
Map saved to ./Report Output/Route Popularity Visualisation.html!
Route Popularity Chart link: ./Report Output/Route Popularity Visualisation.html
From the above chart, it's clear that the most popular weekday cycle hire journeys are in and around Hyde Park, along the banks of the Thames in central London, and around Queen Elizabeth Olympic Park.
From the above chart, it's clear that Cycle Hire is popular among tourists (with hires in the royal parks and around the "touristy parts" of central London).
It's not clear that TfL Cycle Hire is widely used by commuters. A temporal analysis follows, to establish whether clear commuter patterns exist in the data.
def extract_journey_counts_by_station_by_period(df, start_or_end, time_group_freq='15min'):
"""
Convert a dataframe of journeys, to one with journey counts, by hour and station id.
Specify whether to count journeys `start`ing or `end`ing with the `start_or_end` parameter.
:param df: A dataframe of journeys. Requires the following columns: `start_or_end`_station_id, rental_id, `start_or_end`_time.
:param start_or_end: Either `start` or `end`, depending on whether to count journeys started, or journeys ended.
:param time_group_freq: A period over which to count the journeys. Journeys are counted by station_id, time_group cells.
:return: A DataFrame containing hire counts for every station and period combination.
"""
df['date'] = df.start_date.dt.date
# Extract the time of the rental / return to the nearest 15 minutes
df.loc[:, f'{start_or_end}_time'] = df[f'{start_or_end}_date'].dt.round(time_group_freq).dt.time
# Count returns/rentals by day, time and station id
journeys_by_station_by_day_by_period = df.groupby(
['date', f'{start_or_end}_time', f'{start_or_end}_station_id']
)[['rental_id']].count()
# Fill in days and times with no rentals with zero
journeys_by_station_by_day_by_period_pivot = pd.pivot_table(
journeys_by_station_by_day_by_period,
values='rental_id',
columns=[f'{start_or_end}_station_id'],
index=['date', f'{start_or_end}_time']
).fillna(0)
journeys_by_station_by_day_by_period_unpivot = journeys_by_station_by_day_by_period_pivot.reset_index().melt(
id_vars=['date', f'{start_or_end}_time'],
value_name=f'n_hires_{start_or_end}ed_per_period'
)
# Average by time period
journeys_by_station_by_period = journeys_by_station_by_day_by_period_unpivot.reset_index().groupby(
[f'{start_or_end}_time', f'{start_or_end}_station_id']
).mean(numeric_only=True)[[f'n_hires_{start_or_end}ed_per_period']]
return journeys_by_station_by_period.reset_index()
weekday_journey_df = journey_df[~journey_df.is_weekend].reset_index(drop=True)
weekend_journey_df = journey_df[journey_df.is_weekend].reset_index(drop=True)
weekday_started_journeys_by_station_by_time = extract_journey_counts_by_station_by_period(weekday_journey_df, 'start')
weekend_started_journeys_by_station_by_time = extract_journey_counts_by_station_by_period(weekend_journey_df, 'start')
Chart the number of hires that started and ended over time:
def plot_journeys_started_weekend_vs_weekday():
weekday_started_journeys_by_station_by_time.groupby('start_time').mean()['n_hires_started_per_period'].plot(
figsize=(12, 5),
label='Weekday #hires started per period'
)
weekend_started_journeys_by_station_by_time.groupby('start_time').mean()[f'n_hires_started_per_period'].plot(
figsize=(12, 5),
label='Weekend #hires started per period',
)
plt.xticks([time(h*2, 0) for h in range(12)])
plt.legend()
plt.xlabel("Time")
plt.title("Average number of TfL Cycle Hire Journeys Started per 15-minute Period per Docking Station")
plt.ylabel("# Hires per Docking Station per 15-minute Period")
plt.show()
plot_journeys_started_weekend_vs_weekday()
There exists a stark difference between the times at which cycles are rented during the week, in comparison to the weekend. There also seem to exist two clear "rush hour" periods in the week.
To further compare weekday- with weekend journeys, the trips will be analysed to understand where most activity is within each group.
def get_journey_start_end_station_counts(df):
"""
Count journeys for every start-/end-station pair.
:param df: A DataFrame containing journeys. Required columns: `start_station_id`, `end_station_id`, `start_date`, `duration_s`
:return: A DataFrame with journey counts for every station pair.
"""
# Ignore journeys that start and end at the same station.
filtered_journey_df = df[
(df['start_station_id'] != df['end_station_id'])
].dropna()
filtered_journey_df['start_hour'] = filtered_journey_df['start_date'].dt.hour
filtered_journey_df['end_hour'] = filtered_journey_df['end_date'].dt.hour
filtered_journey_df['date'] = pd.to_datetime(filtered_journey_df['start_date'].dt.date)
# Count journeys by their date and the hour within which they started.
journeys_by_station_and_hour = filtered_journey_df.groupby(
['start_station_id', 'end_station_id', 'date', 'start_hour']
)[['rental_id']].count()
journeys_by_station_and_hour.reset_index(inplace=True)
# Count the number of unique dates for which trips exist for every unique station pair:
station_n_unique_dates = journeys_by_station_and_hour.groupby(
['start_station_id', 'end_station_id']
)['date'].nunique()
# Count up the number of trips within every station-pair-hour cell, over all dates:
station_total_trips_by_hour = journeys_by_station_and_hour.groupby(
['start_station_id', 'end_station_id', 'start_hour']
)[['rental_id']].sum()
# Add in the average trip duration for every station pair.
station_total_trips_by_hour = station_total_trips_by_hour.join(
filtered_journey_df.groupby(['start_station_id', 'end_station_id', 'start_hour'])[['duration_s']].mean()
)
# Join the unique date counts, to allow calculations of an average number of hires per hour:
station_total_trips_by_hour = station_total_trips_by_hour.join(
station_n_unique_dates,
how='inner'
)
station_total_trips_by_hour['Hires Started per hour'] = (
station_total_trips_by_hour['rental_id'] / station_n_unique_dates.max()
)
# Add a convenience feature for duration in minutes:
station_total_trips_by_hour['Duration (min)'] = station_total_trips_by_hour.duration_s / 60
# Join on the station location data.
journey_start_end_station_counts = station_total_trips_by_hour.reset_index().join(
foir_station_df,
on='start_station_id'
).join(
foir_station_df[['longitude', 'latitude']],
on='end_station_id',
rsuffix='_end'
)
return journey_start_end_station_counts
Count data is fetched both for weekend and weekday journeys:
journey_start_end_station_counts_weekend = get_journey_start_end_station_counts(journey_df[journey_df.is_weekend])
journey_start_end_station_counts_weekday = get_journey_start_end_station_counts(journey_df[~journey_df.is_weekend])
Load the Kepler chart configuration and pass the transformed data to the chart:
with open('./config/kepler_arc_config.json') as f:
kepler_arc_plot_config = json.load(f)
weekend_journey_arc_plot = keplergl.KeplerGl(
height=500,
data={'pj760hy89': journey_start_end_station_counts_weekend},
config=kepler_arc_plot_config
)
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
weekend_journey_arc_plot.save_to_html(
file_name=REPORT_OUTPUT_DIR + 'Weekend Journeys Arc Visualisation.html',
data={'pj760hy89': journey_start_end_station_counts_weekend},
config=kepler_arc_plot_config,
read_only=True
)
Map saved to ./Report Output/Weekend Journeys Arc Visualisation.html!
weekday_journey_arc_plot = keplergl.KeplerGl(
height=500,
data={'pj760hy89': journey_start_end_station_counts_weekday},
config=kepler_arc_plot_config
)
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
weekend_journey_arc_plot.save_to_html(
file_name=REPORT_OUTPUT_DIR + 'Weekday Journeys Arc Visualisation.html',
data={'pj760hy89': journey_start_end_station_counts_weekday},
config=kepler_arc_plot_config,
read_only=True
)
Map saved to ./Report Output/Weekday Journeys Arc Visualisation.html!
weekday_journey_arc_plot
KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [{'dataId': ['pj760hy89'], 'id': 'xtkquj8…

Weekday Journey Plot link: ./Report Output/Weekday Journeys Arc Visualisation.html
During the week, the main route activity is from Waterloo, King's Cross and Euston stations into the City Centre.
There also seems to be a lot of longer trips happening in Hyde Park.
weekend_journey_arc_plot
KeplerGl(config={'version': 'v1', 'config': {'visState': {'filters': [{'dataId': ['pj760hy89'], 'id': 'xtkquj8…

Weekend Journey Plot link: ./Report Output/Weekend Journeys Arc Visualisation.html
During the weekend, the main routes of activity change significantly. Neither the commuter train stations (Waterloo, Euston and King's Cross), nor the Square Mile seems to feature. Instead, activity is based around the Thames (particularly the South Bank), Hyde Park, and some longer journeys between Camden and the South Bank.
A visualisation is made to illustrate how TfL bicycles move through London during a single day.
SIMPLIFYING ASSUMPTION: Again, the assumption is made that every cycle trip with the same origin & destination coordinates take the same OSRM route through the city.
Take the day with the most cycle hire journeys:
journey_df['date'] = pd.to_datetime(journey_df['start_date'].dt.date)
journey_df.date.value_counts().head(1)
2022-06-21 36539 Name: date, dtype: int64
single_day_journey_df = journey_df[journey_df.date == '2022-06-21']
Kepler's trip module requires a FeatureSet, containing coordinate traces with the format:
[longitude, latitude, altitude, timestamp].
Below, the procedure iterates through every journey in the day, and then through every coordinate in the journey. A timestamp is calculated by assuming the cycle journey travels at a constant speed throughout.
linestring_featureset = []
for ix, row in tqdm(single_day_journey_df.iterrows()):
# Draw out the coordinates from the linestring.
# Calculate the distances between every successive coordinate.
def split_line_segments(line):
return list(map(LineString, zip(line.coords[:-1], line.coords[1:])))
distances = gpd.GeoSeries(
split_line_segments(row.geometry),
crs='EPSG:4326'
).to_crs('EPSG:27700').length
# Set the first coordinate as zero distance travelled:
distances = np.array([0] + distances.values.tolist())
if distances.sum() == 0:
continue
# Use the calculated distances to interpolate a number of seconds passed.
# Assume the bicycle travels at a constant speed throughout the trip.
time_passed_s = (distances / distances.sum() * row.duration_s).cumsum()
# Add the seconds passed to the start date of the trip.
timestamps = (row.start_date + pd.to_timedelta(time_passed_s, unit='s')).map(pd.Timestamp.timestamp)
# Compile the coordinates, together with the timestamps into LineString features.
route = row.geometry.coords
if not any(pd.isna(time_passed_s)):
linestring_featureset.append(
Feature(
geometry = LineString(zip(
np.array(route)[:,0],
np.array(route)[:,1],
np.zeros(len(route)),
timestamps.astype(int).values.tolist()
)),
properties = {
'id': row.rental_id,
'bike_id': row.bike_id,
}
)
)
36539it [07:14, 84.13it/s]
Convert to a FeatureCollection and save the result:
collection = FeatureCollection(linestring_featureset)
with open(INTERIM_OUTPUT_DIR + "single_day_trip_features.json", "w") as f:
f.write('%s' % collection)
with open('./config/trip_visualisation_kepler_config.json') as f:
trip_visualisation_kepler_config = json.load(f)
trip_plot = keplergl.KeplerGl(
height=500,
data={'5ankfi16j': dict(collection)},
config=trip_visualisation_kepler_config
)
User Guide: https://docs.kepler.gl/docs/keplergl-jupyter
trip_plot.save_to_html(
file_name=REPORT_OUTPUT_DIR + 'Single Day Trip Visualisation.html',
data={'5ankfi16j': dict(collection)},
config=trip_visualisation_kepler_config,
read_only=True
)
Map saved to ./Report Output/Single Day Trip Visualisation.html!
REPORT_OUTPUT_DIR + 'Single Day Trip Visualisation.html'
'./Report Output/Single Day Trip Visualisation.html'

Link to plot: ./Report Output/Single Day Trip Visualisation.html
The trip plot illustrates a typical day of Journeys on the TfL Cycle Network. The date selected for the purpose of this analysis (21 June 2022) was a particularly busy day on the network, due to tube strikes affecting London Underground services.
Key Assumption: It's worth reminding the user that the journeys are based on the strong assumption that cycle hire users took the recommended route between stations.
In order to ensure bicycle availability at every docking station, as well as dock availability (to allow a return), the network constantly needs to be rebalanced. TfL do this by moving cycles between docking stations with a truck.
The section below analyses the need for rebalancing, and aims to understand which parts of the networks might see a greater surplus of bicycles.
# Start stations translate to bicycle rentals. End stations are returns.
docking_station_rentals = journey_df.groupby(
['date', 'start_station_id']
)[['rental_id']].count().reset_index()
docking_station_returns = journey_df.groupby(
['date', 'end_station_id']
)[['rental_id']].count().reset_index()
docking_station_rentals.columns = ['date', 'foir_docking_station_id', 'n_rentals']
docking_station_returns.columns = ['date', 'foir_docking_station_id', 'n_returns']
docking_station_returns.set_index(['date', 'foir_docking_station_id'], inplace=True)
docking_station_rentals.set_index(['date', 'foir_docking_station_id'], inplace=True)
# Join the rentals and returns onto the same dataframe
docking_station_movements = docking_station_returns.join(docking_station_rentals, how='outer')
Now a daily bicycle surplus can be calculated:
docking_station_movements['daily_surplus'] = (
docking_station_movements.n_returns.fillna(0) - docking_station_movements.n_rentals.fillna(0)
)
Every bicycle is taken from one station and returned to another. Journeys with missing end stations are ignored in the analysis:
docking_station_movements['daily_surplus'].sum()
0.0
def plot_bicycle_surplus_distribution():
axes = plt.gca()
docking_station_movements.plot(
y='daily_surplus',
kind='kde',
xlim=(-30, 30),
ax=axes,
)
plt.title("The Distribution of Daily Bicycle Surplus by Docking Station")
plt.xlabel("Bicycle Surplus (# of Bicycles)")
plt.show()
plot_bicycle_surplus_distribution()
The distribution of the daily bicycle surplus by station is almost zero-mean normally distributed.
average_daily_surpluse = docking_station_movements.daily_surplus.clip(lower=0).mean()
print(f"On average, for every docking station on the network, {round(average_daily_surpluse, 2)} bicycles need to be manually moved to another station at the end of the day to balance out the network.")
On average, for every docking station on the network, 3.09 bicycles need to be manually moved to another station at the end of the day to balance out the network.
Station information is joined onto the bicycle surplus calculations, to allow a geographic understanding of the rental / return imbalance on the network.
average_docking_station_movements = docking_station_movements.reset_index().groupby('foir_docking_station_id').mean(numeric_only=True)
average_docking_station_movements = average_docking_station_movements.join(foir_station_df)
average_docking_station_movements.dropna(subset=['n_docking_points'], inplace=True)
fig_average_daily_station_surplus = px.scatter_mapbox(
average_docking_station_movements,
lat='latitude',
lon='longitude',
color='daily_surplus',
hover_data=['docking_station_name'],
size='n_docking_points',
size_max=15,
color_continuous_scale='RdYlGn',
mapbox_style='dark',
range_color=[-20, 20],
opacity=0.9,
zoom=11,
height=700,
title='Average Daily Bicycle Surplus by Docking Station'
)
fig_average_daily_station_surplus.show()
This project illustrates how the TfL cycle hire network is used. Cycle Journey Data was taken from Transport for London’s Open Data portal, where extracts are available for more than a decade’s worth of journeys on the TfL cycle network. A script is presented that allows the user to automatically download, clean and compile any subset of the Journey Extract data from 2015 to the current year.
Key findings:
Hire duration: ~87% of journeys are shorter than 30 minutes in duration, with the median journey taking 13 minutes.Journey distance: Subject to simplifying assumptions, the vast majority of cycle hires are under 4 miles long.Journey speed: The median journey has an overall average speed of ~8mph, with weekend journeys generally being slightly slower.Popular journeys: Journeys completed over the weekend differ significantly from those in the week. Weekday trips generally travel from commuter stations into the city centre, whereas weekend traffic is concentrated around recreational hotspots.Commuter use: There exist clear commuter patterns in the data, with rush hour traffic appearing around 8am and 6pm during the week. The analysis found sufficient evidence to show that Cycle Hire is widely used, not only for recreational use, but as an alternative to other modes of commuter transport in the city.The automated load, enrichment and validation script is capable of downloading 7 years' worth of Journey Extract Data from the TfL Open Cycling Data Portal. It contains validation and data cleaning processes to ensure ease of use for any future analysis.
The data enrichment section used OSRM to find realistic route information between stations.
The script loaded a sample of 1.5 million journeys from June and July of this year.
len(journey_df)
981603
journey_df.head()
| bike_id | end_station_id | start_station_id | duration_s | rental_id | end_date | start_date | is_weekend | duration_min | station_pair | geometry | distance_m | distance_mi | speed_mph | date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 18241 | 500 | 542 | 420 | 121447141 | 2022-06-16 14:14:00 | 2022-06-16 14:07:00 | False | 7.0 | (542, 500) | LINESTRING (-0.03376 51.51405, -0.03406 51.514... | 1996.817814 | 1.240768 | 10.635147 | 2022-06-16 |
| 11327 | 5642 | 500 | 542 | 420 | 121597697 | 2022-06-20 08:01:00 | 2022-06-20 07:54:00 | False | 7.0 | (542, 500) | LINESTRING (-0.03376 51.51405, -0.03406 51.514... | 1996.817814 | 1.240768 | 10.635147 | 2022-06-20 |
| 11328 | 8835 | 500 | 542 | 420 | 121455637 | 2022-06-16 17:25:00 | 2022-06-16 17:18:00 | False | 7.0 | (542, 500) | LINESTRING (-0.03376 51.51405, -0.03406 51.514... | 1996.817814 | 1.240768 | 10.635147 | 2022-06-16 |
| 26965 | 18328 | 500 | 542 | 420 | 121497723 | 2022-06-17 15:38:00 | 2022-06-17 15:31:00 | False | 7.0 | (542, 500) | LINESTRING (-0.03376 51.51405, -0.03406 51.514... | 1996.817814 | 1.240768 | 10.635147 | 2022-06-17 |
| 64930 | 14907 | 500 | 542 | 480 | 122241591 | 2022-07-04 09:22:00 | 2022-07-04 09:14:00 | False | 8.0 | (542, 500) | LINESTRING (-0.03376 51.51405, -0.03406 51.514... | 1996.817814 | 1.240768 | 9.305753 | 2022-07-04 |
Together with the journeys, information on ~800 docking stations were also included in the analysis.
len(foir_station_df)
797
foir_station_df.head()
| docking_station_name | n_docking_points | latitude | longitude | |
|---|---|---|---|---|
| foir_docking_station_id | ||||
| 1 | River Street, Clerkenwell | 19 | 51.5292 | -0.109971 |
| 2 | Phillimore Gardens, Kensington | 37 | 51.4996 | -0.197574 |
| 3 | Christopher Street, Liverpool Street | 32 | 51.5213 | -0.084606 |
| 4 | St. Chad's Street, King's Cross | 23 | 51.5301 | -0.120974 |
| 5 | Sedding Street, Sloane Square | 27 | 51.4931 | -0.156876 |
plot_docking_stations(foir_station_df, zoom=10.5, title='Cycle Hire Docking Stations')
During the data enrichment phase, routes for tens of thousands of popular journeys were generated using OSRM. The plot below shows the route geometry coverage over London.
len(station_id_cross_with_geometry)
74201
station_id_cross_with_geometry.plot()
<AxesSubplot: >
By combining Journey and Station data, the analysis is able to provide a visual understanding of how Londoners use the Cycle Hire Network.
The analysis started by showing the distribution of cycle hire length, and illustrated that ~87% of journeys are completed in under 30 minutes. It’s also clear that a steep gradient exists between minute 20 and minute 30 on the duration distribution. Both of these observations are to be expected in light of the hire scheme’s fee structure, which charges a £2 fee for the second half of the journey (at June 2022, the time which the journey data extract was taken from). Unsurprisingly, the duration of cycle hire journeys shifts up over the weekend, as users are more likely to use the cycles for recreational use.
plot_cycle_hire_duration_distribution()
The analysis further applied the Open Source Routing Machine, together with the strong simplifying assumption that all journeys take the recommended route between the start and end station. This allows analysis of cycle hire distance. It was found that most cycle journeys fall in the ½-3 mile range. It is, however, surprising that Cycle Journey distance for trips during the weekend is not significantly different from those trips completed during the week (despite there being a marked difference between hire durations).
plot_cycle_hire_distance_distribution()
The incorporation of distance, together with journey duration, allows the calculation of the average trip speed. Assuming that bicycles take the recommended route between stations, the majority of cycle journeys travel between 5-15mph on average over the journey. It is notable that average cycle speed has a significantly lower distribution for trips that happen on the weekend, compared to weekday trips. Given that there is no significant difference between assumed trip distances on the weekend versus weekdays, it is safe to deduce that cyclists are more rushed during the week than on the weekend.
plot_cycle_journey_speed_distribution()
print(f"The median journey's average speed: {round(journey_df.speed_mph.median(), 2)} mph")
The median journey's average speed: 8.16 mph
By using the OSRM-recommended cycle routes, a plot was derived illustrating the most popular cycle routes in the city, together with the busiest stations. Unsurprisingly, cycle routes follow good cycle infrastructure - namely in and around Hyde Park, along the Thames into the square mile, as well as in the Olympic Park.
Chart Description
Assumed cycle routes are plotted, varying the
stroke widthwith by the popularity of the route. Every line is drawn with some transparency, so areas where many routes will appear with a stronger yellow hue. Docking stations are also shown as circular markers, with their sizes and colors dependent on the popularity of the routes from that station. Yellow colors translate to more popular stations.
Interactive Route Popularity Chart link: ./Report Output/Route Popularity Visualisation.html

The analysis thus far made it clear that Cycle Hire is popular for recreational use (with hires in the royal parks and around the "touristy parts" of central London). To establish whether the TfL Cycle Hire is widely used by commuters, a temporal analysis was done.
plot_journeys_started_weekend_vs_weekday()
By plotting journey start dates over time, and comparing weekday to weekend hire start times, a clear commuter pattern emerges.
The next question to lead from this is where these trips are going. To answer this, lines from the most popular routes were plotted over a map, varying their width by the number of journeys to take that route, and their color by the average duration of the route.
Weekday routes were compared to weekend routes.
During the week, the main route activity is from Waterloo, King's Cross and Euston stations into the City Centre. Over the weekend, the main routes of activity change significantly. Neither the commuter train stations (Waterloo, Euston and King's Cross), nor the Square Mile seem to feature. Instead, activity is based around the Thames (particularly the South Bank), Hyde Park, and some longer journeys between Camden and the South Bank.
Chart Description
Weekday journeys are plotted by drawing an arc from the location of the starting docking station, to that of the destination docking point.
Thicker arcs indicate that more users making that particular journey. The hues indicate the journey duration. Red hues indicate longer trips.
Interactive HTML Weekday Journey Plot link: ./Report Output/Weekday Journeys Arc Visualisation.html

Chart Description
Weekend journeys are plotted by drawing an arc from the location of the starting docking station, to that of the docking station to which the bicycle was returned.
Thicker arcs indicate that more users make that particular journey. The hues indicate the journey duration. Red hues indicate longer trips.
Interactive HTML Weekend Journey Plot link: ./Report Output/Weekend Journeys Arc Visualisation.html

Finally, the project aimed to provide the user with a feeling for what a days’ Cycle Hire Journeys in the capital might look like. To do this, Kepler’s trip plotting library was used. Starting early in the morning, there are low levels of activity, with the number of cycles increasing exponentially during rush-hour.
Chart Description
Cycle journeys are plotted as points moving with a disappearing trail. Open the interactive HTML chart in a new tab (preferably in Chrome) and press the ▶️ button to see cycle movement throughout the day.
Interactive HTML Day's Trips Plot link: ./Report Output/Single Day Trip Visualisation.html

The idealistic view of the cycle hire scheme is that it would be self-balancing. This would mean that, for every user renting a cycle from a particular station, another user returns one. A perfect self-balancing network would require minimal intervention, and consequently have a lower logistical overhead. This project attempted to ascertain the extent to which the TfL cycle hire scheme is self-balancing. By analysing rental and return numbers by station, a daily bicycle surplus could be established.
Daily Bicycle Surplus: The number of cycles returned to a docking station, minus the number rented from that station in a day.
plot_bicycle_surplus_distribution()
print(f"On average, for every docking station on the network, {round(average_daily_surpluse, 2)} bicycles need to be manually moved to another station at the end of the day to balance out the network.")
On average, for every docking station on the network, 3.09 bicycles need to be manually moved to another station at the end of the day to balance out the network.
The analysis found that, on average, 3 bicycles need to be moved from one docking station to another every day to perfectly rebalance the network.
Despite the relatively small average, there are some docking points that contribute to a significant imbalance in the network, as was clear from the spatial analysis.
By visual inspection, the majority of surplus bicycles are left in the city centre, with most stations running a deficit lying further out of the city - particularly towards the (hilly) North.
Chart Description
The chart below plots every docking station on the network. Marker size is varied by the number of docking points that are available at that station. Stations with more docking points will appear as larger markers on the plot. Marker colour is varied by the average daily surplus of bicycles returned to that station. Stations experiencing a net loss of bicycles during a day (on average), will be shown with a reddish hue. Stations with a large surplus of bicycles (i.e. more returns than rentals) at the end of the day will appear with a green hue.
fig_average_daily_station_surplus.show()
There are also some interesting individual cases. For example, the docking station at “Hop Exchange”, which is close to Borough Market, sees the largest average daily surplus of any station on the network. Cyclists seem to like cycling to the market, but not cycling back from it. This might be because the station is popular with consumers, who might not want to cycle back with their fresh groceries.

Similarly, stations in the West End and on the South Bank typically see a large surplus of cycles. This possibly indicates that cycle users are keen to cycle into the city, but prefer to take another method of transport out of the city.

This project was able to:
The most obvious limitation of the report is that GPS traces from the individual bicycles were not available to use. Therefore, an assumption was made on the route taken between stations in order to allow inferring journey distance and speed, as well as plots illustrating cycle routes and route popularity through the capital.
The study was also limited by data quality issues in the FOI Station Data. Six docking stations had to be completely removed due to these issues. The author has made a note to TfL, pointing out the issues with duplicate station ids. If an improved station dataset becomes available, this would allow a more complete analysis.
There are a few avenues for future work that could leverage the analysis done in this project.
Investigate Annual Trends
The data load script is capable of loading Journey Extract Data from any period from 2015 onwards. A future analysis could generate a report for different periods in time and assess the way in which the network - and its utility - is changing.
Assess the Effects of New Infrastructure
For example, new docking stations are continually added to the network. It would be interesting to see how new docking stations affect the balance of the system, as well as the way in which the network is used.
Assess the Effects of New Bicycle Types
TfL just introduced Electric Bicycles onto the cycle hire network. It would be interesting to see how journeys on these compare to the regular cycles. Do they increase the average hire duration or the average journey distance? Would they improve balance on the network by allowing journeys that would’ve otherwise been too strenuous?
Analyse Urban Regeneration Trends
The Olympic Park shows significant activity on the Cycle Network. This is a public area of London that did not exist in its current form before 2012. Similarly, it would be interesting to see how newly regenerated areas attract activity from other parts of London. For example, one can expect to see an uptick in activity around the new Battersea Power Station development in future datasets. This analysis could be used to ascertain the success of new public spaces and regeneration projects.
Future Infrastructure Proposal
Another avenue for future work could focus on the needs for future infrastructure. Which docking stations are severely over-burdened? Are there any changes that can be made to the docking station network to improve the balance of the network? How would increasing the docking station sizes allow for a greater imbalance buffer?